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Abstract 

This paper describes a probabilistic top-down parser for minimalist grammars. Top-down parsers 
have the great advantage of having a certain predictive power during the parsing, which takes place in 
a left-to-right reading of the sentence. Such parsers have already been well-implemented and studied 
in the case of Context-Free Grammars (see for example [RoaOl]), which are already top-down, but 
these are difficult to adapt to Minimalist Grammars, which generate sentences bottom-up. I propose 
here a way of rewriting Minimalist Grammars as Linear Context-Free Rewriting Systems, allowing 
us to easily create a top-down parser. This rewriting allows also to put a probabilistic field on these 
grammars, which can be used to accelerate the parser. I propose also a method of refining the 
probabilistic field by using algorithms used in data compression. 
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Throughout this paper, I will refer as a subtree of a tree T, the set of nodes in T dominated by 
a particular node, which will be the root of the subtree. On the other hand, a cut is the set of the 
leaves of a finite prefix tree of T. 



1 Introduction 

The idea of this parser is to see a minimalist grammar (MG) as a linear context-free rewriting system 
(LCFRS) on its derivation trees. This transformation allows us to work on a grammar without 
movement, generating sentences from top to bottom (on contrary of MG, which generates sentences 
bottom-up), and to put a probabilistic field on it. 

1.1 Minimalist grammars 

Minimalist grammars are designed to generate (subparts of) human natural languages. They are 
framed in Chomsky's minimalist program [Cho95], and were first described by E. Stabler in [Sta97]. 
For the sake of clarity, I will in this paper use slightly different convention to represent the trees 
generated by a minimalist grammar. 

Minimalist grammars distinguishe themselves from more classical context-free grammars by the 
fact that they allow movement, commonly required by syntacticians to generate such sentences as (for 
example) 'Which mouse did the cat eat', where 'which mouse' is base-generated at the end of the 
sentence (in the object position), and moves at the front. The tree corresponding to this sentence, as 
generated by the toy Minimalist Grammar we will consider here as example, is the following: 

(1) 

= V + wh • c 



which : ■ 




did : • = V + wh c > 

= d = d-v, = nd - — wh 




= n ■ d — wh 

where to denotes the trace of the subtree 'which mouse', which has moved in front of the 
sentence. A trace is kept for psychological reasons, as these traces can be shown to be still 
present for the computation of the meaning of sentences. They also allows to keep what is 
called the deep structure, corresponding to the tree where no movement happened, and all 
constituents are in their base position, where lexical selection takes place. 
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A minimalist grammar takes several lexical elements, and builds a tree with them. The toy grammar 
we consider will have the following lexical items: 

(2) •mouse .: n 

•cat :: n 
•the ::= n d 
•which ::= n d — wh 
•ate ::= d = d c 
•eat .:= d = d v 
•did :.= V + wh c 
•did :.= V c 

This grammar generates (roughly) all affirmative/interrogative past sentences about a cat and a mouse 
eating each other. 

As can be seen, many symbols are used next to the actual phonetic contents of the words (the, 
eat, cat, etc.). These are syntactic features, and the sequence of these in a lexical item represents 
its syntactic category, and is all that is needed to compute the tree. Two lexical items with the same 
lexical category can be freely interchanged without losing grammaticality. 

The syntactic features may be of four types: 

- categories, represented by a string of letters, among which is the distinguished feature c, used to 
recognise the grammatical outputs. For example, n. The set of categories will be noted Cat. 

- selectors, represented by the string of letters of a category, preceded by a =. For example, =n. 
The set of selectors will be noted Sel. 

- licensees, represented by a string of letters preceded by a - . For example, -wh. The set of licensees 
will be noted Licensee. 

- licensors, represented by a string of letters corresponding to a licencee, preceded by a +. For 
example, +wh. The set of licensors will be noted Licensor. 

Syntactic features must follow a certain order, to ensure good formation of trees : Syn = (Select(SelectU 
Licensor) * ) CatLicensee* 

The trees are computed by using two functions on the lexical items, to form constituents (i.e., 
trees): 

- m,erge, when a selector selects a corresponding category, 

- move, when a licensee moves to a corresponding licensor. 
Let's see how this works on our little tree: 

• Take which : • = n d — wh and mouse : -n. We added a • in front of the syntactic categories to 
keep track of the derivation. Here, the two features just right of the dot (the current features 
are =n and n. It's a selector and its corresponding category, so we can merge them to a bigger 
constituent: 



< 

= n • d — wh 




which : • = n d — wh mouse : -n 
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Both of the syntactic categories are copied to the root of the new constituent, with their dots 
moved one step right, since the current features were used. The category with the selector always 
comes first. If, as it is the case for the syntactic category of mouse, the dot ends up at the far 
right, the category may be left out, since it won't have any role in the further derivations. The 
< indicates the head of the constituent, i.e. the constituent where the selector came from. 

• Then merging the new constituent, whose syntactic category is = n-d — wh, with eat : • = d = dv 
(note the selector =d, corresponding to the current d) gives: 



< 

= d- = d V, = n d • — wh 




eat : • = d = d V < 

= n ■ d — wh 




which : • = n d — wh mouse : -n 
Note also that here, the second syntactic category still has something right of the dot, so stays. 

< 

= n • d 

• Together with ^^^^^^^^ , it merges into: 

the := n • d cat : n- 



= d = d-v, = nd - — wh 




eat : • = d = d V < 

= n ■ d — wh 

which := n d • — wh mouse : n- 
Note that for merging, only the first category in the list of syntactic categories is considered. 
Here, in = d- = d v, = n d • — wh, only = d- = d v is considered (in fact, only • = d). 
Note also that if the constituent with the selector is complex (i.e. is not formed of a single lexical 
item), as it is here the case, the merging happens in the other way: head right and selected 
constituent left. This has to do with the fact that english is a SVO language. 
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• It can then merge with did : ■ = v + wh c, giving: 

< 

= V • +wh c, = n d • — wh 




did : • = V + wh c > 



= d = d-v, = nci - — wh 




which : ■ = n d — wh mouse : -n 

• Finally, we can apply move, this function applies to a single constituent, whose first syntactic 
category has a licensor right of the dot. Here, +wh. It will then scan the other categories to 
find a corresponding licensee right of a dot (here, -wh), and move the corresponding constituent 
to the top of the tree, giving the final sentence: 



> 

= V + wh • c 



which : ■ 




did : ■ = V + wh c > 

= d = d-v, = nd - — wh 




= n ■ d — wh 

The dots are moved as usual, the > indicates the constituent where the licensor was, and in 
this case, since the only feature right of a dot is the distinguished feature c, we know that the 
derivation yielded a grammatical output. 
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The constituent moved corresponds to the biggest constituent whose head is the lexical ele- 
ment containing the considered lincensee (this corresponds to the syntactic notion of maximal 
projection). 

The phonetical content of this sentence is the concatenation of the phonetic contents of its leaves, 
in left-to-right reading: 'which mouse did the cat eat'. 

The notion of head is an important notion in linguistics, since, by the principle of locality of 
selection, we want to restrict the amount of information that an item has access to. As such, the only 
information about a constituent accessible from outside (for a merge operation, for example), is the 
right-of-the-dot features of its head, that is, the features of the first syntactic category (minus the 
left-of-the-dot ones, kept only for the sake of historic bookkeeping). 

1.2 Derivation trees 

Another way of representing the constituents generated by a grammar is by using its derivation tree : 

Definition 1.1. The derivation tree of a constituent is a binary tree showing the history of its building 
by the functions merge and move. Its leaves are lexical items, and its nodes are labelled by either • 
(merge, it's then a binary node) or o (move, it's then a unary node). 

For example, the derivation tree of the previous example (1), is: 

(3) 




which ::= n d — wh mouse :: n 

Let's note that to each subtree of a derivation tree corresponds a unique constituent, appearing in 
the construction of the main one. We can then label each node of a derivation tree by the syntactic 
category of the corresponding constituent. 
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2 Probabilistic minimalist grammars 



2.1 Derivation trees of MG as LCFRS-derived trees 

The basis of this method is to see minimahst derivation trees as trees generated by hnear context- 
free rewriting systems (LCFRS). Putting a probabihty field on these is indeed very easy. A similar 
approach was also used in the minimalist parser of H. Harkema in his thesis [HarOl]. 
We thus take a general minimalist grammar Q = (cr, Feat, Lex, J^). 

The closure of Lex under J- gives the outputs of the grammars. Here, we will not consider these 
outputs, but the derivation trees describing the process giving these outputs. 

One important difference between context-free grammars, for which probabilistic versions are well- 
studied, and minimalist grammars is that CFG generate trees from top to bottom, by means of rules 
rewriting each non-terminal node by a number of other nodes, while a MG generates trees from bottom 
to the top, by merging and moving elements. While in the CFG case, we begin with a single symbol 
and then choose rules to rewrite it (thus enabling us to assign probabilities to the process by assigning 
probabilities to the rewriting rules), in MG, we begin with a bunch of lexical items, not necessarily 
compatible with each other, and merge them together (and occasionally moving them too). Here we 
will present a way of seeing the generating process of MG as a LCFRS, which, as CFG, generates from 
top to bottom with a set of rules. The differents non-terminal symbols will be defined by closure of 
a certain set of axioms (starting symbols) under a set of inference rules, giving this way a top-down 
way of generating derivation trees of MG. 

2.1.1 Categories and partial outputs 

In order to do this, we will first define a particular type of objects, called categories, which will be the 
non-terminal symbols of our LCFRS: 

Definition 2.1. A category is either a lexical item, or a sequence of the form [7o-<5o, • • • j'Jk'^klj where 
7o, . . . , 7fc, (5o, . . . , 5fc <ire elements of Syn, or a special symbol start. A simple category is a category 
with its first dot at the leftmost place (and k = 0). Otherwise, it is a complex category, start is 
neither simple or complex. 

Categories corresponds exactly to the list of syntactic categories defined in 1.1, although our 
definition allows here categories which cannot be generated by a Minimalist Grammar. We will of 
course only be interested in those who are. 

We then define a partial output as a string Ai . . . A„ of categories. These represent a particular 
stage in the construction of a minimalist derivation tree by the corresponding LCFRS, the different 
categories being the categories of the partial derivation tree which is build. 

2.1.2 Axiom 

There is a single axiom, the category start. 

2.1.3 Inference rules 

These rules correspond to the rewriting rules of the Linear Context-Free Rewriting System iS corre- 
sponding to our Minimalist Grammar Q. For each possible application of one of the functions merge 
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or move of grammar Q giving a particular category A, there is a corresponding inference rule (which 
gives quite a lot of rules...). Then, given a particular category A, the rules will tell how this particular 
type of tree (remember that categories describe a particular type of trees generated by the grammar) 
can be un-merged or un-moved into one (in case of un-move) or two (in case of un-merge) different 
types of trees. To this must be added the rules expanding the start category, and the lexicalisation 
rules. The first allows us to begin with a unique symbol, instead of all categories ending with the 
distinguished feature. The second ones allow us to leave the lexical part of the parsing up to the last 
moment. Thus here we are: 

1. Start rules: for every lexical item j :: 6 c, 
Start: 

start — > [6 ■ c] 

2. Re-writing rules for complex categories: 

(a) Un-merge rules: the left-hand category is of the form [6 = x - f3,S] 

i. Cases where the selector was a simple tree {6 = e): 

A. For any lexical item of feature string 7 x, 
Unmerge-1: 

[=x./3,5]^[.=x/3][7.x,S] 

B. For any element (7 x • (/?) S S", with 5' = S — (7 x • (/?), 
Unmerge-3, simple: 

[=x-/3,7x-v?,5'] ^[•=x/3][7.x(^,5'] 
It should be noted that necessarily, ip 

ii. Cases where the selector was a complex tree: 

A. For any decomposition S = U UV , and any lexical item of feature string 7 x, 
Unmerge-2: 

[5 = x./3,5] [5- =x/3,C/][7-x,y] 

B. For any element (7 x • (/?) G 5, and any decomposition S = U UV U {'^ y. ■ ip) , 
Unmerge-3, complex: 

[5 = x-p,^x-^,S'] [5- =x/3,C/][7-x^,y] 
As in 2(a)iB, has to be non empty. 

(b) Un-move rules: the left-hand category is of the form [5 + f • /3, 5] 

i. For any (7 — f • 93) G (necessarily unique by the Shortest Movement Constraint), with 

5' = S-(7-f-v'), 
Unmove-2: 

[5' + f • /?, 7 - f • 5'] ^ [5' . +f /3, 7 . -f 5'] 

ii. If there is no (7 — f • (^) G S, then for any lexical item of feature string 7 — f , 
Unmove-1: 

[5' + f-/3,5]^[y-+f/3,7--f,5] 

3. Re-writing rules for simple categories: for any lexical item A :: /3, 
Lexicalize: 

[•/?] ^ A :: /3. 
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The set of relevant partial outputs can thus be defined as the closure of the axiom start under 
the inference rules. This set describes exactly all possible partial outputs given by the LCFRS S, i.e. 
all possible strings of categories obtained by a cut through a tree generated by the LCFRS S. Such 
a string correspond to a selection of outputs (not necessarily complete) generated by the minimalist 
grammar Q, such that they can be put together by application of merge and move, in the same 
order (two categories will get merged only if they are adjacent in the string) to obtain a complete 
output. A relevant output is a relevant partial output consisting of only lexical items. It corresponds 
to grammatical sentences. 

The relevant categories are exactly the categories that appear in a relevant partial output. They 
correspond to the possible sets of similar partial trees generated by the grammar Q. They arc in finite 
number, since, by the Shortest Movement Constraint, no two identical licensees can appear in the 
feature strings of a relevant category (omitting the first string). Thus two identical feature strings 
(diverging only by the position of the dot) can't appear together, and therefore the total length of all 
the feature strings of a relevant category is bounded by the sum of the length of all the feature strings 
of the lexical items, which is finite. 

2.1.4 Derivation trees 

With this formalism, we have now a quite straightformard way of defining minimalist derivation trees, 
in a way that enables us to put very simply probabilities on them: they are just the trees obtained 
by maximal application of rewriting rules to the axiom start. The probability is simply given by a 
probability field on the rules. 

2.2 Probabilities on MG derivation trees 

To define a probability field on the derivation trees of a MG, we now just have to put conditional 
probabilities on the rules discussed before, given the initial relevant category. The probability of a 
given tree will then be the product of the probabilities of the rules that generate it, as for regular 
probabilistic linear context-free rewriting systems. There can be however quite a lot of such rules and 
relevant categories, even if the MG is quite simple, but they can all be computed beforehand with the 
only knowledge of the grammar, thanks to the definition by closure of these categories. Indeed, we 
will see a simple method permitting to compute both the relevant categories and the inference rules 
that are needed. 

It should be noted that the functions (Merge-1,2,3 and Move-1,2) having potentially given birth 
to a given relevant category are quite few (at most two), only if we use the dot notation, which keeps 
track of a minimal part of the history of the derivation. This is why the relevant categories should 
include all features of the lexical item potentially heading the tree (and not just the ones on the right 
of the dot). 

To settle things a bit, we will here illustrate this method with a little example. 
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2.3 Example : a'^^" 

We will here consider the MG with the following lexical items (e being the empty string): 



(4) 



•e 



c 



•e 



= a + m c 



•a 



b a — m 



•b 



h 



•b 



a + m b 



This grammar generates exactly the strings of the form a^ft", n S N. Since this is a context-free 
language, we wouldn't have needed to use licensors and licencees, but for the sake of getting a language 
simple enough with enough rules (especially movement ones), we will work on this one. 

We now want to get the relevant categories of this language, and the corresponding 'context-free 
rules'. A quite straightforward way to obtain them is to start from the axiom start and follow the 
inference rules to close the set of relevant categories. From start we apply the schemes to get all 
applicable rules, apply them, get some new relevant categories, apply the schemes to get new rules, 
apply them, etc... Since they are in finite number, this algorithm will eventually terminate, giving us 
all the relevant categories and needed rules (we won't get them all, since the schemes could apply to 
non-relevant categories, but we don't want those in any case). 

So here we go: 

• starting rules: we search for all lexical items whose features ends with c. There are two here, 
giving two different relevant categories: e :: c and e :: = a + m c. We have thus two rules: 

Start: start — > [-c] 



We have now two new relevant categories: [-c] and [= a + m • c]. We will now write the rules 
with these on the left side of the arrow. 

• [-c] correspond to case 3. There is but one lexical item with features c, which is e :: c, so we have 
a single rule: 



No new relevant category is created, so we can move to the next one: 

• [= a -|- m • c] corresponds to the case 2b, so we can have two possibilities. Since there is no '5', 
only the case 2(b)ii can apply. We must then look for lexical items whose last feature is — m. 
There is but one (and thus only one corresponding relevant category), a :: = b a — m. So we 
have one possible rule: 

Unmove-1: [= a -|- m • c] — > [= a • -|-m c, = b a • — m] 

We have now a new relevant category, [= a ■ +m c, = b a ■ — m]. 

• [= a • +m c, = b a • — m] corresponds to case 2(a)i. For case 2(a)iA, we have to look for a lexical 
item whose last feature is a. Since there is no such item, we fall back to 2(a)iB. Here we have to 
look in '5" for feature strings of type 7 a • 99. There is only one, namely = b a • — m, so we have 
one rule: 
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Unmerge-3, simple: [= a • +m c, = b a • — m] — > [■ = a + m c] [= b • a — m] 

We got here two more relevant categories, [• = a + m c] and [= b • a — m]. 

• [• = a + m c] corresponds to case 3, and there is but one lexical item with the corresponding 
features, so we have one additional rule: 

Lexicalize:[- = a + m c] — > e :: = a + m c 

• [= b • a — m] corresponds to case 2(a)i. We first try case 2(a)iA. We look for lexical items with 
last feature b. There are two such items, namely 6 :: b and b :: = a + m b. We then have two 
rules: 

Unmerge-1: [= b • a — m] — > [• = b a — m][-b] 
Unmerge-1: [= b • a — m] — > [■ = b a — m] [= a + m • b] 

Since 'S' is here empty, case 2(a)iB can't apply, and we move on to the three newly discovered 
relevant categories, [• = b a — m], [-b] and [= a + m • b]. 

• [• = b a — m] is ready to be lexicalized, there is still only one corresponding lexical item, so we 
get the rule: 

Lexicalize: [• = b a — m] — > a :: = b a — m 

• [-b] is in the same case, we thus have: 

Lexicalize: [-b] — > b ::h 

• [= a + m • b] corresponds to the case 2(b)ii, with only one corresponding lexical item, thus the 
rule: 

Unmove-1: [= a + m • b] — > [= a • +m b, = b a • — m] 

• [= a • +m b, = b a • — m] corresponds to case 2(a)iB, and we have one rule: 

Unmerge-3, simple: [= a • +m b, = b a • — m] — > [■ = a + m b] [= b • a — m] 
Since [= b • a — m] has already been treated, we can move to the last untreated relevant category: 

• [• = a + m b] is ready to be lexicalized: 

Lexicalize: [• = a + m b] — > b :: = a + m b 

We are now ready to give probabilities to these rules, conditioned by the left-hand side. The 
assignment here is quite easy : apart from the two cases where there are two possible rules (axiom 
choice and category and [= b • a — m]), the conditioned probability will be 1 (there is no choice). For 
the two other cases, we can assign any probability A to one rule, and give the other a probability 1 — A. 
We can now give the following table: 
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start 
start 



[= a + m • c] 



Start 
Start 



A 

1 - A 



[■c 



Lexicalize 



1 



[= a + m • c 



[= a • +in c, b a • — m] 



Unmove-1 



= 1 



[= a • +m c, b a • — m 



[• = a + m c] [= b • a — m] Unmerge-3, simple P 



a + m c 



a + m c 



Lexicalize 



1 



[= b • a 
[= b • a 



m 
m 



b a — m] [-b] 

b a — m] [= a + m • b] 



Unmerge-1, simple 
Unmerge-1, simple 



1-M 



b a 



m 



a .: 



h a 



m 



Lexicalize 



[•b 



6 :: b 



Lexicalize 



[= a + m • b 



+m b, = b a • 



-m 



Unmove-1 



a • +m b, = b a • 



-m 



[• = a + m b] [= b • a — m] Unmerge-3, simple 



a -|- m b 



b :: = a -|- m b 



Lexicalize 



We will now end by giving the probability of a particular derivation tree: 
(5) [aabbe : = a -|- m • c] 

[e : = a • -|-m c, aabb : = h a ■ — m] 



e : • = a + m c 



e :: = a -|- m c 




[a : ■ = b a — m] 

[a :: = b a — m] 



[abb : = a -|- m ■ b] 
[b : = a • +m h,ab : = b a • — m] 



[b: 




: a + m b] 

I 

a -I- m b 



[ab : = b • a — m] 



[a : ■ = b a — m] [i> : -b] 

I 

a::=ba — m 6::b 



All the rules here have probability 1, except the top one, the choice of the start rule start — y 
[aabbe : = a + m-c], which has probability 1 — A, the one from [aabb : = b-a — m], which has probability 
1 — /i, and the one from [a6 : = b-a — m], which has probability fi. So the complete tree has probability 
//(I — /Li)A, and, for example, the subtree headed by [6 : = a • +m h,ab : = b a • — m] has probability ji. 
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2.4 The Cats and Mouses example 

Let's now get back to our toy grammar (2) and see how it rewrites: 



(6) •mouse :: n 

•cat :: n 
•the ::= n d 
•which ::= n d — wh 



•ate 
•eat 
•did 
•did 



d = dc 
d = d V 

V + wh c 

V c 



The rules are the following: 



start 
start 
start 



[= d = d • c] 
[= V + wh • c] 
f= V • cl 



Start 
Start 
Start 



[= d = d 



n.d][ 



d cl 



Unmerge-2 



[= d- = d c] 



d = dc][=n-d] 



Unmerge-1 



d = d cl 



ate :: = d = d c 



Lexicalize 



n d][-n] 



Unmerge-1 



n d] 



the :: = n d 



Lexicalize 



9 
10 



•nj 
•nl 



mouse :: n 
cat :: n 



Lexicalize 
Lexicalize 



11 



V + wh • cl 



[= V • +wh c, = n d • — wh] 



Unmove-1 



12 



V • +wh c, = n d • — wh] 



[• = V + wh c] [= d = d • V, = n d • — wh] 



Unmerge-1 



13 



[• = V + wh c] 



did 



V + wh c 



Lexicalize 



14 
15 



d = d • V, = n d • — wh] — > [= n • d] [= d- = d v, = n d 
d = d • V, = n d • — wh] — > [= n • d — wh] [= d- = d v] 



-wh] Unmerge-2 

Unmerge-3, complex 



16 



d- = d V, = n d • — wh] 



[• = d = d v] [= n • d — wh] 



Unmerge-3, simple 



17 



d = d v] 



eat :: = d = d v 



Lexicalize 



18 



d — wh] 



[• = n d — wh] [-n] 



Unmerge-1 



19 



n d — wh] 



which :: = n d — wh 



Lexicalize 



20 



d- = d v] 



[. = d = dv][=n-d] 



Unmerge-1 



21 



V • c 



[. = V c][= d = d- v] 



Unmerge-1 



22 



V c 



did :: = V c 



Lexicalize 



23 



[= n • d] [= d- = d v] 



Unmerge-2 
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3 The probabilistic top-down parser 



The parser will work on an ordered list of hypothesis, which he will expand in turn during the parse 
of the sentence. Before beginning presenting the algorithm, some definitions are needed: 



3.1 Definitions 

One difficulty in working with derivation trees instead of regular derived trees, is that the order of 
the words cannot be easily deduced (short of redoing the actual derivation). In order to keep track 
of the position of a category in the derived tree (so the parser may know in which order to expand 
the tree), we introduce position indices, which denotes positions in the derived tree from its root by a 
chain of digits (0 if going down left, 1 if going down right). From this perspective we can also define 
a successor operator on them, corresponding to a left-to-right sweep of the tree. 
Consider the grammar given by the following lexical items: 

(7) l.a::=x + fc 

2.6 :: = y X 
3.c::y-f 

This grammar will generate the derived tree: 



(8) 



c : y 




b: = y X- — f 

Corresponding to this tree is the derivation tree: 

(9) [=x + f.c] 

I 

[=x.+fc,y-q 



[■ = X + f c] 




The parser should try to expand the nodes leading to the first leaf in the derived tree (8), but is 
actually building the derivation tree (9). As such, it should begin by expanding right-most nodes, 
then switch back to left-most ones when c is parsed to parse a, etc... Position indices showing in 
which position which category is can be computed online and incorporated to the derivation tree, 
for example 0/ for all categories corresponding to c, since its final position is just one branch down 
and left from the root. The parser will just have to expand the unexpanded nodes with lowest (i.e. 
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leftmost) position indices. In order to do this, it can keep track of a pointer telling up to which point 

nodes have been expanded, and expand the corresponding one. Then upgrading the pointer with the 
adequate notion of successor keeps the parser working. To formalize this: 

Definition 3.1. A position index is a element it G {0, 1}* U {—1}. 

Its successor s(7r) is defined to be: 



Two positions indices 7r,7r' correspond i/vr' = nP,P GO*. In this case, we say also that tt points to 



The notion of correspondence enables the parser to ha.vc some liberty in the pointer indicating the 
index to be expanded. Indeed, the parser will not try to expand the node with the index exactly equal 
to the pointer, but just corresponding to it, that is, equal to the pointer with as many Os as possible 
following, or, in the derived tree, down the leftmost path from the node indicated by the pointer, 
which is what we would want: the first unexpanded node down the pointer. 

Definition 3.2. A situated category is a pair a"'/[F"'], where a" is a sequence of n position indices 
and F"- is a sequence of n dotted feature strings (so that [F'^] is a category). For readability, we will 
write (ai, . . . . . . as [ai/Fi, . . . , Un/ Fn]. 

Definition 3.3. A hypothesis is a 5-uple (T, 7r,p, s. A) where: 

• T is a finite set of situated categories (the nodes of the partial derivation tree), 

• TT is a position index, the pointer, pointing to the next node to expand, 

• p E [0, 1] is the probability of the hypothesis, 

• s is a dotted input string, and 

• A is the sequence of rules used to obtain this hypothesis from the axiom start. 

The dotted input string s is the string of word of the phrase being parsed, with a dot indicating up 
to which point it has already been parsed (in fact, up to which point the words have been scanned). 
For example, if s = "The cat has • eaten the mouse", this means that this hypothesis has already 
scanned (i.e., recognised a node for) the words "The", "cat" and "has", but not yet "eaten", "the" 
and "mouse". 

3.2 Position indices and nodes 

The parser will expand the hypothesis trees in a quite particular way, corresponding to a left-to-right 

reading of the output sentence. Since movement is possible in MG, the parser will have to keep track 
of the 'position' of the different elements, to only expand the leafs corresponding to the currently 
parsed word. This is the role of the position indices. 




al i/vr = aO/?,/3 G 1* 

-1 i/vrGl* 
undefined otherwise 
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A position index different from —1 will represent a particular subtree in the final derived tree, 
where the traces of the moved nodes are deleted (moving up its sister to the position of its mother). 
The position index ao • • • «fc corresponds to the subtree dominated by the node obtained by going 
down in the tree from its root, left if qq = 0, right otherwise, then again left if cti = 0, right otherwise, 
etc... 

Back to our toy grammar (7): 

1. a :: = X + f c 

2. 6 :: = y X 

3. c::y-f 

The derivation tree with indexed relevant features corresponding to (8): 

[/ = x + f.c] 



[l/ = x-+f c,0/y-q 




[10/- = x + fc] [ll/ = yx,0/y-q 
[ll/-=yx] [0/-y-f] 

The indexed relevant category at the root of the derivation tree has a empty position string since 
it represent the derived tree itself, and in [11/ = y-x,0/y-— f] for example, we have 11/ = y-x because 
this relevant category represent the tree under the node obtained if you go right (1), then right again 
(11) from the root node of the derived tree (without the moving categories, since they will move so 
won't end up at the same place). We have similarly 0/y • — f because the subtree described by y • — f 
ends up as the left (0) daughter of the root of the derived tree. 

The assignment of these position strings is given by the inference rules, which will be discussed 
later. 

3.3 Axiom 

The axiom of the parser are exactly the same as the axiom for the LCFRS corresponding to our MG 
discussed in 2, plus an empty position string (it represents the whole derived tree...). Its probability 
will be of course 1, and the pointer will be set as e. So, if the phrase to be parsed is w, we have a 
parser axiom (e/ start, e,l, -uj, { )). 

3.4 Inference rules 

We here have exactly the same inference rules as before, exept that these will assign position strings 
too. So here they are: 

1. Start rules: for every lexical item j :: 6 c, 
Start: e/ start — > [e/5 ■ c] 
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2. Un-merge rules: the left-hand category is of the form [a/ 5 = x • 0, 5] 

(a) Cases where the selector was a simple tree {8 = e): 

i. For any lexical item of feature string 7 x, 
Unmerge-1: 

[a/ = X • e, S] ^^^^^ 

[aO/-=x0] [al/7-x,5] 
t is here s if 7 = (and thus S" = too), and c otherwise. 

ii. For any element (7 x • 99) € S*, with S' = S — {j x ■ if), 
Unmerge-3, simple: 

[a/ = x-e,^/-fx-ip,S']^ ^^^^^^ 

[a/-=x6] [/3/j-x^,S'] 
t is here s if 7 = (and thus 5' = too), and c otherwise. It should be noted that 
necessarily, ^ 0. 

(b) Cases where the selector was a complex tree: 

i. For any decomposition S = U UV , and any lexical item of feature string 7 x, 
Unmerge-2: 

[a/6 = x-e,S] 

[al/6-=xe,U] [aO/7-x,y] 
t is, as always, s if 7 = (and thus V has to be empty too), and c otherwise. 

ii. For any element (7 x • (/?) € S", and any decomposition S = U\JV\J{'~fx-ip), 
Unmerge-3, complex: 

[a/6 = x-e,(3/-f x-ip,S'] 

[a/5-=xe,U] [/3/j-xip,V] 
t is still s if 7 = (and thus V has to be empty too), and c otherwise. As in 2(a)iB, ip 
has to be non empty. 

3. Un-move rules: the left-hand category is of the form [5 + f ■ 9,S] 

(a) For any (7 — f • 99) E S (necessarily unique by the Shortest Movement Constraint), with 

Unmove-2: 

[a/5' + f-e,/3/j-i-ip,S']^ I 

[a/5'-+f0,/3/7--f ^,S'] 

(b) If there is no (7 — f • 99) G S", then for any lexical item of feature string 7 — f , 
Unmove-1: 

o 

[a/6' + i-9,S]^ I 

[al/6' ■+ie,aO/j--i,S] 

There is no lexicalise rule, since it will in fact be replaced by a 'scan rule', checking if the feature 
string of the word currently parsed corresponds to the current feature string. 
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3.5 Top-down parser 

The parser takes an input string co = ojq . . .ojn-i, a minimalist grammar Q rewrited into a LCFRS <S 
and a beam function /, setting a threshold to the probability of the selected hypothesis. It will work 
on a priority queue of hypothesis Ti. The function / can be very general, here we will consider that 
its argument is the priority queue "H. The parser works as following: 

1. Beginning: The parser start with the queue of hypothesis consisting of the axiom (e/ start, e, 1, -w, ( 
of the grammar. 

2. Expanding: At each step, the parser will: 

• take the top-ranked hypothesis {T,7:,p,Uq ■ w^, A) (i.e. the hypothesis with greatest p) in 
the priority queue, 

• check the corresponding position string pointer. If tt = — 1, and the parsing dot in Wq • oj^ 

is at the far right (i.e. k = n), then the parser terminates and returns the sequence of rules 
A. If the phrase is not completely parsed (vr = — 1 but k < n), the hypothesis is deleted 
and the parser moves to the next one. If tt 7^ —1, the parser moves to the next step, and 
tries to: 

• find the leaf of T, C, in which is the position string a corresponding to the pointer tt. If 

a 7^ TT, TT is set to a. 

• expand C. For this, we have two possibilities: 

(a) Expand: If C is a complex situated category, the parser will delete the current hypoth- 
esis {T,p,7r,u!Q ■ Wfc, A) and add to the priority queue, for all possible inference rules 
C — y t, a new hypothesis {T',tt',pF{C — > t),oj^ ■ w^, A@C — > t), such that T' is 
T where the node C has been replaced by t, and vr' is either vrO if the rule did change 
the value of the position string corresponding to tt (i.e. if the rule was Unmerge-1, 
Unmerge-2 and Unmove-1, and the first element of C had a for position string), and tt 
in the other cases. @ is the concatenation operator. 

(b) Scan: If C is a simple indexed category, say C = [a/ ■ S], then the parser will delete the 
current hypothesis (T, 7r,p, Wq -CiJ^f, A), and try to lexicalize C. It will do two things: 

i. Scan, e: If there is a rule [-6] — > e :: 6, then a new hypothesis (T', s{n),pF{[-5] — > 
e :: d),u!Q • w^, A@[-(5] — y e :: 6) is added to the priority queue, where T' is T where 
the leaf C was replaced hy e :: S. 

ii. Scan, /: If ujk ^ is in the grammar, then a new hypothesis {T' , s{Tr),p¥{[-6] — > 
ijjk :: 5),(jjQUJk ■ i^fc+i) A@[-(5] — > ujk ^) is added to the priority queue, where T' is T 
where the leaf C was replaced hy cok ■■ ^■ 

If these two steps fail, then no hypothesis is added to the priority queue. 

The new hypothesis are inserted in the priority queue at their 'right place', i.e. after all 
hypothesis of higher probability. 

• Prune: The parser deletes all hypothesis of the priority queue whose probability is lower 

than f{n). 

If Ti is empty, then the parse failed and the sentence is judged ungrammatical. 



19 



Input String 

UJi . . .UJn-l 
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3.6 Example 

Here we will present a small example of the parsing of a particular sentence of the grammar we 
presented in (4), consisting of the the lexical items: 

1. e :: c 

2. e :: = a + m c 

3. o :: = b a — m 

4. b::h 

5. 6 :: = a + m b 

The corresponding LCFRS was consisting of these rules: 



SI 
S2 



start 
start 



[= a + m • c] 



Start 
Start 



P(. 



.7 
.3 



LI 



e :: c 



Lexicalize 



Mvl 



[= a + m • c 



Unmove-1 



[= a • +m c, b a • — m] 



Mgl 



a • +m c, b a • — m 



Unmerge-3, simple P(.) = 1 



[• = a + m c] [= b • a — m] 



L2 



a + m c 



e :: = a + m c 



Lexicalize 



Mg2 



Mg3 



[= b • a — m 



[= b • a — m 



[• = b a — m] [-b] 



[• = b a — m] [= a + m • b] 



Unmerge-1, simple P(.) = .4 



Unmerge-1, simple P(.) = .6 



L3 



b a — m 



a :: = b a — m 



Lexicalize 



L4 



•b 



b::h 



Lexicalize 



Mv2 



[= a + m • b 



[= a • +m b, = b a • — m] 



Unmove-1 



Mg4 [= a • +m b, = b a • — m 



Unmerge-3, simple P(.) = 1 



[• = a -|- m b] [= b • a — m] 



L5 



a + m b 



a + m b 



Lexicalize 



1 



Here we took A = .7 and /i = .4. 

We will now try to parse the string aabb, which is generated by the grammar (the e is of course 
omitted) . 
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The parser begins with his stack consisting of the axiom of the grammar: 

Ti = ({e/ start, e, 1, -aabb, { ))) 

The parsers takes the top-ranked hypothesis, {e/start,€, 1, -aabb, ( )), its pointer (null), the cor- 
responding leaf, start, and tries to expand it. There are two possibilities, which are added to 
the hypothesis queue, ordered by decreasing possibilities: 

n = (([e/ • c], e, .7, -aabb, (SI)), ([/ = a + m • c], e, .3, -aabb, {S2))) 

The parser now takes the top-ranked hypothesis, ([e/ • c], e, .7, -aabb, (SI)), its pointer (null), the 
corresponding leaf, [/ • c], and try to scan it since it is a simple category. There is only one rule 
whose left-size is [/ • c], LI, with probability 1. The corresponding word is empty, so the scan 
succeeds, the pointer is increased to —1 and the new hypothesis is added to the queue in place 
of the old one: 

n = (([e/e :: c], -1, .7, -aabb, {SI, LI)), ([/ = a + m • c], e, .3, -aabb, (52))) 

The parser takes once more the top-ranked analysis, ([e/e :: c], —1, .7, -aabb, {SI, LI)). Its pointer 
is —1, so the parser checks if the parse is indeed over. No, since the string left of the dot is non- 
empty. The current hypothesis is now deleted, and the new queue is fed to the parser: 

n = (([/ = a + m • c], e, .3, -aabb, {S2))) 

The top-ranked analysis is now the only one in the queue, {-aabb, [/ = a-|- m • c], .3, ). Its pointer 
is null, the corresponding leaf is [/ = a -)- m • c], which the parser will try to expand. There is 

o 

only one rule, Mvl:[= a -|- m • c] — > \ , so the hypothesis is replaced by the 

[= a • -|-m c, b a • — m] 

new one: 

n = l{ I , 0,-3, -aabb, {S2, Mvl)) 

\ [1/ = a • -|-m c, 0/b a • — m] 

(note that the pointer was modified since the position vector of the expanded element was 
modified by the rule) 

The parser goes on, giving the new queue: 



n 



\ 



, 0, .3, -aabb, {S2, Mvl, Mgl)) 



\ [l/- = a-Fmc] [0/ = b-a-m] 



And so forth... 
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4 Proofs of soundness and completeness 



4.1 Soundess of the pointer 

We will here demonstrate that the parser, at each step, will indeed find the correct leaf to expand 
with the current pointer. 

First we modify a little, for convenience of the proof, our definition of a position string: 

Definition 4.1. A position index (and a pointer^ is a dotted, almost null binary sequence a- (3,aP G 
{0, Ij'^O for some k. 

A position index and a pointer correspond if their undotted sequences are the same. 
The set of all undotted position indexes is naturally ordered by the lexicographic order. 

Note: This definition is consistent with the one I proposed in the precedent section, the dot being 
the place where the 'new' position indexes are to be truncated to obtain the 'old' ones. 

Proposition 4.1. There is a one-to-one application from the set of all position indexes I to the set 

ST of all subtrees of the infinite complete binary tree T. The head of the subtrees corresponding to 
the positions indexes of the type ai . . . q:„ • are exactly the nodes of depth n of the tree. We thus have 
a notion of domination on the position indexes, corresponding to the notion of domination in the tree 
(by convention, a node dominates itself). A position index a ■ dominate another position index a' ■ 
if a is a prefix of a' . 

Proof. Let (b : - ^ where T is the subtree headed by the node obtained by, 

ai . . . a„ • — 7- 1 

starting from the root of T, for each Oj, going left if = and right otherwise. This application has 
clearly all the above properties. □ 

Lemma 4.1. For every cut C of position indexes, 

i. if f3 ■ is in the cut C , then there is no -OinC, for every 7 € {0, 1}^. 
a. If /3O7 • is in the cut C, then there is a unique k £ N such that /310^ -0 is in the cut. 
Hi. The lexicographic order can be extended to the dotted elements of C . 

Proof. Let n be the depth of the cut. 

i. Suppose that we have /3 • and /37 • in the cut, for some 7 G {0, 1}"*". Then /37O" • is dominated 
by both [5 ■ and /37 • 0, which is a contradiction. 

ii. Since C is a cut, /310" • must be dominated by a unique element of C, say 5 • 0. 6 \s then a 
prefix of /^lO". But 5 cannot be a prefix of /3O7, since otherwise 5 ■ would dominate /3O7 • 0, 
which is already dominated by itself. So 5 must be of the form piQ^, k < n. The unicity follows 
from i. 

iii. This follows directly from i. 

□ 

We can now prove that the parser will never have a pointer problem: 
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Theorem 4.1. At each point of the parse, the position indexes of (all) the hypothesis form a cut, 
the already scanned position indexes form a prefix set of all the position indexes (for the lexicographic 
order), and the pointer correspond to the smallest unscanned position index (which exists since the set 
is finite), or is possibly -1 if there is none. 

Proof. We'll prove this result by recurrence on the number of steps done by the parser. 

• At the beginning, the set of the position indexes is reduced to {-0}, and the pointer is -0, so the 
result is trivially true. 

• Suppose that at step n, the position indexes of (all) the hypothesis form a cut, the already scanned 
position indexes form a prefix set of all the position indexes (for the lexicographic order), and the 
pointer correspond to the smallest scanned position index (which exists since the set is finite), 
or is possibly -1 if there is none. Let a • be the position index corresponding to the pointer 
(unique by hypothesis), and [/3 • 0/A, . . . , a • 0/A', . . .] the leaf to be expanded. By hypothesis, 
a • < /3 • 0. Let {5i, . . . ,5k/ /a -0, . . . , /3 -0, . . .) the positions indexes, lexicographically ordered, 
the scanned ones left of / /. By hypothesis, this is a cut. The state of the parser at step n + 1 
can be obtained by four different cases: 

1. if the position index corresponding to the pointer, a ■ 0(< /3 • 0), is not modified by the rules, 
two cases: 

(a) if the main position index of the leaf being expanded, /3 • 0, is not modified (i.e. during 
Un-merge-3 or Un-move-2), no position indexes are modified, nor the pointer, so the 
result still holds. 

(b) if the main position index of the leaf being expanded, /3 • 0, is modified (i.e. during 
Un-merge-1,-2 or Un-move-1), the new position indexes are (lexicographically ordered, 
the already scanned ones left of the //) 
(5i,...,5fc//a-0,...,/30-0,/31-0,...). 

This is trivially still a cut, the pointer is still a ■ 0, corresponding to the smallest 
unscanned position indexes a ■ 0. 

2. if the position index corresponding to the pointer, a ■ 0(= /3 • 0), is modified by the rules, 
the new position indexes are 

(5i,...,<5fc//aO-0,al -0,...). 

This is still trivially a cut, and the new pointer is aO ■ 0, corresponding to the smallest 
unscanned position index aO ■ 0. 

3. if the parser scans the leaf (and then a • = /3 • 0): 

If a E 1*, the pointer is set to —1, there cannot be a greater item in the cut than a G 1* 
(since it would then have to have a as a prefix, which is impossible by lemma 4.1), so the 
result is true. 

Let's then write a = 7OI'". The pointer is set to 7I • 0. The new position indexes are 
(5i,...,(5fc,701™-0//C-0,...). 

Since the only thing that changed here is the position of the //, this is still a cut. By 
lemma 4.1, there exists a unique ^ = 7IO'' in the cut, since a = 7OI™ was in it. This 
being the smallest position index greater than a = 7OI"*, we have C = C = 710*", which is 
corresponding to the new pointer. This ends the demonstration. 

□ 
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4.2 Completeness 

Here we demonstrate the completeness of the parser, without the pruning step, i.e. that if the string 
to be parsed can indeed be generated by the grammar, then the parser will eventually parse it. 

Lemma 4.2. Let (a„)„gN G (S*)^ a infinite sequence of finite distinct strings over a finite alphabet 
T,. Suppose that the following hold: 

Vn G N, all prefixes of an are in {ak,k < n}. (1) 

Then there exists an infinite sequence {xk)k<=N £ such that {xq . . . Xk)k^f^ is a subsequence of 

(«n)neN- 

Proof. Let's build this sequence recursively: 

• Among all elements of S, there is an element which is prefix of infinitely many elements of 

6N- Let's call it xq. By hypothesis, xq G (a„)ngN- 

• Suppose we have xq, . . . ,xn such that {xq . . . Xk)kelo.Nl is a finite subsequence of (a„)„gN. With- 
out loss of generality, we can restrict (an)nGN to its subsequence composed of the elements 
(xo • • • Xk), k G [0, A^] and all elements of (a„)„gN with prefix xq . . . x^. This new sequence is 
still infinite, and has the prefix property 1. 

Among all elements of E, there is an element x such that xq • • • x^x is prefix of infinitely many 
elements of (a„)„gN- Let x = xn+i- By hypothesis, xn+i G (an)nGN- 

• {xk)k&i has the property we seek. 

□ 

Theorem 4.2. For all p G (0,1], if there is no looping chain of rules of probability 1 in the grammar, 
there are finitely many partial derivation trees (PDT) of probability > p. 

Proof. A partial derivation tree is exactly defined by the string of rules deriving it. So a PDT will 
here be seen, when convenient, as a string of rules. 

Suppose there were infinitely many PDT of probability > p. Then we have a sequence of strings 
of rules as defined by lemma 4.2. The lemma then gives us a sequence ^40, . . . , An, ... of rules such 
that Vn Aq . . . An defines a correct partial derivation tree (since Aq . . . An is in the initial sequence, 
composed of correct PDTs). 

To this sequence of rules corresponds an infinite sequence of growing PDT, all of which have 
probability > p. Since the sequence is growing and infinite, there is an infinite path in the limit of the 
trees, given by the sequence of rules {A^(^n))neN- 

Or, differently put, there is a sequence of rules {A^(^n))neN, such that for all n > 1, the left side of 
An is in the right side of An~i. 

Then there is a finite sequence j4^(fc), . . . , A^(^j^i) such that 

{\{n))neN = \(0),\{1), ■ ■ ■ ,\{k-l)^ {\(k)^- ■ ■ ^\{k'))^- (2) 

Indeed, let A be the finite state automaton with: 

• states G N and END, 

• starting state ^,^(o); ending state END, 
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• transitions rules — > ^tp{n+i) ^''^^ ^ip{n) — ^ END for all n. 

This automaton generates exactly all prefixes of By the pumping lemma, there is a 

string 74^(0)^1^(1) • • • Aip{N) aiid two integers k, k' such that 

■ ■ ■ \{k-i){\{k) ■ ■ ■ \{k'))*\{k'+i) ■ ■ ■ \{N) is in the language recognised by A, that is, 
prefixes of (vl^(„))neN- This is exactly 2. 

Let p' = rife Then for all n G N, p'" is an upper bound of the probability of some PDT 

of probability > p (take any PDT where (^^(fc), . . . , is in the sequence of its rules). Thus 

p'^ > p > Vre, which is impossible unless p' = 1. This contradicts the hypothesis that no looping 
chain of rules in the grammar has probability 1. □ 

Corollary 4.1. // there is no looping chain of rules of probability 1 in the grammar, the parser is 
complete. 

Proof. If there is no looping chain of rules of probability 1, then Theorem 4.2 holds. 

Let Aq, . . . , An be the sequence of rules giving the parse of the parsed string. We'll show that for 
all k E |0, n], the parse will have after finitely many steps Aq . . . A^ as its top-ranked hypothesis. 

Proof. Let pk = Yli=o^(^i)- Let's prove the result by recursion on k: 

• for k = 0, Aq is an axiom so is in the priority queue from the beginning of the parse. By theorem 
4.2, there are finitely many PDTs of probability greater than po, say N. Since the parser will have 
each of them at most once as its top-ranked hypothesis, will be the top-ranked hypothesis 
before step + 1. 

• suppose that after M steps, Aq . . . Af^ is the top-ranked hypothesis of the parser. Then at the 
(M + 1)*'' step, the parser will expand Aq . . . A^, and put (among others) Aq . . . A^Ak+i in the 
priority queue. Since, by theorem 4.2, there are finitely many PDTs of probability greater than 
Pk+i, say A^, and the parser will have each of them at most once as its top-ranked hypothesis, 
Aq . . . Af^A/^^i will be the top-ranked hypothesis before step M + 1 + + 1. 

This completes the recursion. 

□ 

The result follows from the case k = n. □ 

5 Conditioning the rules with the CTW algorithm 

In this section we present a way to improve the performances of the parser, using the Context Tree 
Weighting (CTW) algorithm, whose description and properties may be found in [WST95] and an 
implementation in [FP]. The algorithm is originally intended to be used in data compression, but its 
construction of context trees allows us to use it in our parser. 

The CTW algorithm uses a double mixture of context trees, and its force resides in the fact that 
conditional probabilities may be computed recursively, allowing for a great decrease in compuation 
time. Its idea is to mix together the Krichevski-Trofimov estimators for all possible context trees of 
depth less than a certain M, allowing for a near-optimal coding (and as such, estimate of conditional 
probabilities, in the sense of the Kullback-Leibler divergence). 
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5.1 Quick overview of the CTW 

The setting is the following : consider a stationary ergodic source of unknown law P. We want to 
estimate P by P, which will be a mixture of context tree laws. 

5.1.1 Sources with a context tree 

Definition 5.1. A complete prefix dictionnary V on X (of cardinal K) is a cut of X* (seen as a tree), 
that is, a finite subpart of X* such that for all x_oo:-i; there is a unique m such that X-m-.-i G T^- Let's 
call f its context function, defined by f{x-oo:-i) = x-m-.-i- Its depth, noted 1{T>), is the maximum 
length of its elements (or, more simply, the depth of the cut). 

Suppose that we have reasons to think that P has a context tree V (or at least, can be succesfuUy 
approximated by such a law), that is, P is stationary and for all x-^:n, P(-^n = ^^nl^-oorn-i = 

X—fx>:n—l) — P(^n — ^^j-j | /(^_oo:n— l) — /(s^— oo:n— l))- 

We then want to estimate the conditional probabilities for all contexts in V. Suppose that, for a 
context s, 9^ is the law on X, conditionaly on the context s. Then, for a source with context V, of 
parameter {6^)sev, 



Pl5,e(^l:n = Xi:n\X_oo:0 = X-oo:o) = Y[^V,ei^l:n = Xl:n\f {X.^o-.o) = f{X-oo:o)) 

i=l 

= JJ P6'=(S's(xi:„;X_tx):0)) 

sev 

where Fqs is the law of a string of i.i.d. variables of law 6^, and -^^(xi:^; a;_oo:o) is the substring of 
symbols in Xi-n with context s. 

The idea is to mix all those probabilities for all 6^ : for a prior distribution i'x>{d9) = Wg^j) ^{dO^), 
where u is a measure on the set of possible 6^, the simplex 6 = {(^i, . . . , Ok) £ [0, l]''^! = 1}, we 
get: 

KTv,u{xi:n\X-oo:0) = / ^V,e{xi:n\X-oo:o) 

= 11 [ n'>iSs{xi:n;X^^..o)Mde') 

A good choice for is a Dirichlet ]D)(l/2, . . . , 1/2) distribution: the Dirichlet Law with parameter 
a = (ai, . . . , ax) is the law on with density 

/m,....fe)^ ';<"'; -;°"' nr 

T{ai)...r{aK) fJi 

with respect to the Lebesgue density. 

This distribution has nice properties, for example, an oracle inequality for the codc-lcngth. This 
choice gives the Krichevski-Trofimov estimator KTx> = KTx,p(^i/2^...^i/2) foi" sources with a context 
tree. 
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Lemma 5.1. Let 

- Cs (xi:n|x_oo:o) denote the number of y in context s, 

- Cs(xi:„|x_oo;o) = ^yCi[x\;n\x -ryo:^ , the number of occurence of context s. 
Then.' 

T{K/2) Uyex r(cl(xi:„|x_oo:o) + 1/2) 



KTv{xi;n\x^oo;0) = ^ 



^ (1/2)^ r(Cs(xi:„|x_oo:0) + 1/2) 

These quantities may be recursively computed by the following lemma (for a binary alphabet): 

Lemma 5.2. Let Pe{a,b) denote the K-T estimator (giving the probability of having a for a partic- 
ular context s, a (resp b) representing the number of (resp 1) seen in context s. Then Pe(0,0) = 1, 
and for a > 0, 6 > 0, 

Pe{a + l,b)= "^^(^ Pe(a,6) and p^(a,b + 1) = -^^^Pe{a,b). 

The proof will not be presented here, but is quite easy, and may be found in [WST95]. The case 
of an alphabet of size K is identical, but much longer to write... 

5.1.2 The Context Tree Weighting Method 

When the context tree of our source is unknown, one solution is to mix over all possible context trees 
(of a certain maximum depth). The Context Tree Weighting methods consists in this idea: if vr is a 
probability on context trees, we take 

CTW{xi..n) = Y,T^{V)KTv{xi,n) 

V 

Typically, we will take for vr a branching law, in which all nodes of depth less than a certain M 
has probability a < 1 /K of having K daughters. 

An important result is that this method is universal: 

Theorem 5.1. If {Xn)n&i ergodic, stationary oflawF, then, P-a.s., 

lim --logCTW^iXi.,n) = H{¥) 

n— >oo n 

where H{¥) is the entropy o/P. 

The great advantage of this method is that it may be recursively computed. 
We will now present quickly how this can be done. 

Let us denote, for each context s and x & X, Xs as the number of x seen in context s, and 
Pe{{xs)x£X,y) the corresponding K-T estimator (that is, the probability of having y £ X in context 
s). 

Definition 5.2. To each node s of the context tree T of depth M , we assign a weighted probability 
which is defined as 

P^(y) = J K — fo"^ ^ Hs) < M 

^ 1 Pe{{xs),y) otherwise 

This construction has the expected property, that is, P^(.) = CTW{.\s), for the a = 1/K mixture. 
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5.2 Application for the parser 



This algorithm may be apphed to our parser, to condition the rewriting rules. We can, with this 
method, condition on whatever we want, provided it appears as a linear string of symbols. An obvious 
(and perhaps a bit naive) choice would be to condition on the string of rules already used in the parse. 
Another would be to condition on the string of rules descending directly to the expanded node from the 
root. For example, let's see, with our cats and mouses grammar (2) the sentence 'Which mouse did 
the cat eat' (3), when the parser will try to expand the node giving 'the cat' (at point 'Which mouse 
did • the cat eat'): it will be for example in state (index of the rules indicated in (.)) 

(2) 

I 

o 
(11) 



(12) 




Lex • 



(13) (14) 



did ::= v + wh c 




Lex Lex 



(19) (9) 

I I 
which ::= n d — wh mouse :: n 

and will have already constructed the following string of rules: 2-11-12-14-16-18-19-9-13. It will 

condition on the string of rules 2-11-12-14, meaning that: 

- it's the subject of a VP with moving object (14), 

- it's a past sentence (12), 

- it's an interrogative sentence (11 and 2) 

Indeed, only the first seems relevant, but using all string of rules will condition first by the fact 
that it is a past sentence, that a mouse is involved, etc... and will have to go up 6 rules to know 
that it is a subject that is currently expanded... which seems the important information. This 
conditioning allows to condition roughly on the successive heads c-commanding the expanded node 
(plus movement information, which is difficult to get rid of). Of course we could want a different 
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method for lexicalisation rules, where thematic and semantic information would be better... but such 
a method is difficult to implement, having to insure that conditioning still gives a proper distribution. 
Three points seem important to precise: 

- First, although the CTW algorithm works on any finite alphabet, it is much more efficient on 
binary ones. Ron Begleiter and Ran El-Yaniv discussed a method in [BEY06] to make the 
algorithm binary even in the case of an non-binary alphabet, by putting chains of CTWs (i.e. 
sorting the rules in a binary tree, and having a CTW algorithm for each branchement). 

- Second, the distribution, as can be seen in the previous examples, is very sparse... a given context 
can be followed by two or three different rules, even one, while the alphabet is huge, with 23 rules 
in the cats and mouses grammar (which is very simple). Of course, more complete grammar will 
induce a lot more variability in the possible rules for a given node, but the number of rules will 
grow too. However, the restriction of rules expanding a given node can be implemented directly 
in the structure of the context tree of the CTW, provided we know in advance the grammar 
-which is of course the case. 

- Finally, it is possible that, although the grammar allows for a choice of rewriting rules for a given 
category, there is in fact no such choice (or with vanishing probability). Then it is possible to use 
a slightly different base estimator instead of the Krichevski-Trofimov one: the zero-redundancy 
estimator P^^{a, b) defined as: 

iPe(a,&) fora>0,6>0 
|Pe(a,0) + i for a > 0,6 = 
iPe(0,6) + i fora = 0,6>0 
1 for a = b = 

This estimator better recognises sources generating only Os or only Is. 

Conclusion 

The method described here permits to see Minimalist Grammars as the more 'classical' and above all 
simpler Linear Context-Free Rewriting Systems (which don't have movement, and generate sentences 
top-down), by taking a different point of view - considering derivation trees instead of derived trees. 
This enabled us to easily put a probability field on these grammars, and to parse them in a top-down, 
incremental way, giving a progressive parse as the words of the sentence are discovered. The probability 
field allowed us to implement a beam-search in the parser, pruning the different hypothesis to select 
only the more likely ones. This should accelerate the parser, while making it fail in identifying 'garden- 
path'-type sentences. The use of more refined probabilistic tools as the CTW algorithm permits to 
have a better estimation of the real probability field, by conditionning the expanding rules by its 
context - here, the nature of the c-commanding heads, as required by the current linguistic theories. 



Pf''{a,b) = { 
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